Global protein function prediction in protein-protein interaction networks 
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The determination of protein functions is one 
of the most challenging problems of the post- 
genomic era. The sequencing of entire genomes 
and the possibility to access gene's co-expression 
patterns has moved the attention from the study 
of single proteins or small complexes to that of the 
entire proteome [1]. In this context, the search 
for reliable methods for proteins' function assign- 
ment is of uttermost importance. Previous ap- 
proaches to deduce the unknown function of a 
class of proteins have exploited sequence similar- 
ities or clustering of co-regulated genes [2,3], phy- 
logenetic profiles [4], protein-protein interactions 
[5,6,7,8], and protein complexes [9,10]. We pro- 
pose to assign functional classes to proteins from 
their network of physical interactions, by mini- 
mizing the number of interacting proteins with 
different categories. The function assignment is 
made on a global scale and depends on the entire 
connectivity pattern of the protein network. Mul- 
tiple functional assignments are made possible as 
a consequence of the existence of multiple equiva- 
lent solutions. The method is applied to the yeast 
Saccharomices Cerevisiae protein-protein interac- 
tion network [5]. Robustness is tested in presence 
of a high percentage of unclassified proteins and 
under deletion/insertion of interactions. 

Two hybrid experiments, as that conducted by Uetz et 
al. [5] , allow to reconstruct the set of physical binary in- 
teractions among a set of proteins of a given proteome. In 
order to introduce our approach, we visualize the protein- 
protein interaction data as a graph whose nodes repre- 
sent the proteins and edges connect pairs of interacting 
proteins [11,12]. The suggestion we want to implement 
here is that interacting proteins may belong to at least 
one common functional class. Therefore the knowledge 
of the functional classification of a subset of the proteins 
involved in the network may lead to an educated guess 
on the functional classification of the remaining subset 
of uncharacterized proteins. In principle, to each pro- 
tein should assigned one or more functional classes drawn 
from a set of F possible classes. F is the total number of 
functions considered and depends on the functional clas- 
sification scheme used. Finer is the definition of function 
used in the classification scheme and greater is the num- 
ber F . In general, however, the functional class is known 
only for a subset of proteins and one faces the problem 
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FIG. 1: Illustration of the method. Sub-graph of the 
protein-interaction network of the yeast Saccharomices Cere- 
visiae. Proteins in gray boxes are unclassified (unknown func- 
tion) while the others are classified proteins with the func- 
tions within the brackets and labeled according to the fol- 
lowing criteria: 1- cell growth; 2- budding, cell polarity and 
filament formation; 3- pheromone response, mating-type de- 
termination, sex-specific proteins; 4- cell cycle check point 
proteins; 5- cytokinesis; 6- rRNA synthesis; 7- tRNA synthe- 
sis; 8- transcriptional control; 9- other transcription activities; 
10- other pheromone response activities; 11- stress response; 
and 12- nuclear organization. Given one of these proteins 
of unknown function if we take as a prediction the function 
that appears more in the neighbor proteins of known func- 
tion then we obtain the following classification (from top to 
bottom) YNL127W (2), YDR200C (3,4,10) and YLR238W 
(12). Our method, however, considers also the interactions 
among unclassified proteins. If we iterate once more the ma- 
jority rule by taking into account the interactions between the 
three unclassified proteins, we obtain the following classifica- 
tion YNL127W (2,4), YDR200C (3,4,10) and YLR238W (12). 
In this way we find another possible function for YNL127W. 
Fhis is actually the spirit of our method, we take advantage 
of the prediction we are making for proteins of unknown func- 
tion and apply a global optimization method. 



of assigning a function cr,, chosen among all F possible 
functions, to each unclassified protein i. 

Assigning to an unclassified protein the most common 
function(s) present among the classified interacting pro- 
teins, as in [7,8] (majority rule assignment) is rather 
straightforward. The majority rule relies on the empirical 
evidence that 70-80% of interacting proteins pairs share 
at least one function. In most cases [13], however, few un- 
classified proteins have more than one interacting protein 
with known function. In addition, in these few cases, the 
interacting proteins with known functions do not usually 
have shared functionalities (Fig. 1). In this perspective, 
the use of the majority rule assignment is rather unsat- 
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isfactory since all links between proteins with unknown 
function are completely neglected. This implies a very 
partial use of the knowledge of the protein-protein inter- 
action network. More important, the final configuration 
of assigned functions to unclassified proteins ought to be 
consistent with the rules used to determine the functions 
themselves. To an unclassified protein with one or more 
unclassified partner (s) must be assigned functions con- 
sistently with the functions assigned to the unclassified 
partners. This points out to a method in which unknown 
proteins influence the majority rule assignment in a self- 
consistent way once their functions have been selected. 

Our functional prediction strategy is based on a global 
optimization principle: a score or energy (see Experimen- 
tal protocol section for details) is associated to any given 
assignment (configuration) of functions for the whole set 
of unclassified proteins. The score is lower in configura- 
tions that maximize the presence of the same functional 
annotation in interacting proteins. The new ingredient is 
that the contribution to the total score of a given func- 
tional assignment of an unclassified protein is computed 
as the number of classified and unclassified neighbor pro- 
teins with that function. Hence, the determination of 
the functions of all unclassified proteins in the network 
becomes a global optimization problem and can not be 
solved on the basis of the local environment only. Finding 
the optimal function assignment corresponds to deter- 
mine the minimal score for the whole network. This cor- 
responds, in statistical mechanics in minimizing the en- 
ergy of a Pott's model [14] with non-homogeneous bound- 
ary conditions, the latter being represented by the pro- 
teins with known function. The resulting computational 
problem is frustrated; i.e. it is generally impossible to 
satisfy all the constraints imposed by classified proteins 
on their interacting, unclassified partners. This leads to 
a multiplicity of equivalent or nearly equivalent optimal 
solutions that contain a minimal amount of interacting 
proteins with different functions. The existence of multi- 
ple solutions allows the objective assignment of multiple 
functions to most unclassified proteins (Fig. 1). Depend- 
ing on the complexity of the underlying graph and on the 
boundary conditions, the score minimization represents 
an hard computational task. In instances of this type, 
simulated annealing technique (see Experimental proto- 
col section) is an appropriate tool to obtain the optimal 
solutions. Indeed, the optimization procedure is repeated 
several times to account for the non uniqueness of the op- 
timal configurations and a prediction for the functional 
classification is finally made by taking those functions 
that, for each unclassified protein, occurred more often 
in the whole set of simulated annealing processes . 

We have applied our functional-prediction method to 
the yeast Saccharomices Cerevisiae protein-protein inter- 
action network. The interaction data was obtained from 
Rcf. [7] and it contains N = 1826 proteins with E = 2238 
identified interactions. The functional classification was 
obtained from the MIPS database [15]. The MIPS finer 
classification scheme contains F = 424 functional cate- 
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TABLE I: Global optimization vs majority rule. Com- 
parison of the success rate of the global optimization (GO) 
method proposed here and the majority rule (MR). To com- 
pute the success rate we assume that a fraction /„ = 0.4 of the 
classified proteins are unclassified and then make functional 
predictions for them. The success rate is defined as the prob- 
ability that the most ranked predicted function is actually a 
functional classification for the corresponding protein. Two 
different levels of functional classification have been used. In 
the finest level (1) we have taken the finest classification, con- 
taining 424 functional categories. In the coarse-grained level 
(2) we have taken the highest level classification (metabolism, 
energy, cell growth and division, etc.), containing 20 func- 
tional categories. We show the success rate as a function 
of the number of interacting partners k (as a reference we 
also show how many proteins nk has k interacting partners). 
The case k = 1 is not considered since the MR method finds 
only a trivial implementation in this case. The comparison 
of the values for k > 2 clearly indicates that the global opti- 
mization method performs better, with higher percentage of 
correct predictions. Moreover, the more coarse-grained is the 
classification the higher is the success rate. Not surprisingly, 
adopting a coarser classification scheme leads to an increasing 
of the various rate of success (last column) since the number 
of degrees of freedom the method has do deal with is drasti- 
cally reduced. This, of course, has to be balanced with the 
parallel diminution of the information content of predictions. 



gories, plus two categories for proteins with no assigned 
function: 'CLASSIFICATION NOT YET CLAR-CUT' 
and 'UNCLASSIFIED PROTEINS'. The data contains 
n = 441 proteins in these two last categories. We used 
our global optimization method to obtain the functional 
assignments of all the proteins listed within these two cat- 
egories. The complete set of functional assignments can 
be found as Supplementary Table 1. For each unclassi- 
fied protein we report its degree, i.e. number of proteins 
directly connected to it, and up to three of the most prob- 
able predicted functions as found with our method. We 
attribute a higher level of certainty to those functions 
with a higher percentage of occurrence. 

A fundamental issue concerning protein function pre- 
dictions is the assessment of the method reliability with 
respect to the incomplete knowledge of the interaction 
network. In order to establish an upper bound to the 
predictive power of our method, we ignored the func- 
tionality of a finite fraction /„ of the classified protein 
and then measured the rate of successful predictions of 
our method by comparing with their real classification. 
In this way, one can get a quantitative estimate of the 



reliability of our predictions as a function of the amount 
of available information about the network. Fig. 2a 
shows the percentage of successful predictions as a func- 
tion of the degree of the proteins for different values of 
/„ using the finest functional classification scheme avail- 
able (424 functional classes). In the case of unclassified 
proteins with degree larger than 2, a correct prediction 
can be made between 60%- 70% of the cases, even with 
the loss of a substantial part of the information (up to 
/„ = 0.4) and quite independently of the degree of the 
protein involved in the prediction. A visual inspection 
of the test for /„ = 0.4 can be obtained browsing the 
Supplementary Table 2. A quantitative account of the 
better performance of our method with respect to lo- 
cal optimization methods is presented in Tab. 1, where 
we also report predictions obtained with a coarser clas- 
sification scheme. These results prompt to a large and 
stable statistical predictive power of the method also in 
the presence of reduced information (larger number of 
unclassified proteins). 

A second test concerns the presence of errors in the 
topology of the protein network. It is known that protein- 
protein interactions data obtained from two hybrid ex- 
periments contain an amount of false positives and nega- 
tives and these could in principle alter sensibly the qual- 
ity of predictions by providing a spurious connectivity 
to the network (false or missing edges). The effect of 
this uncertainty can be modeled by rewiring a certain 
fraction of protein interactions. More precisely, with a 
probability q, each reported interaction is ignored and a 
new interaction is randomly drawn between two proteins 
that do not interact according to the available data. In 
this way we obtain a new network with a certain degree 
of dissimilarity depending upon q, with the original one. 
The degree of dissimilarity fi is measured as the percent- 
age of edges between proteins couples which are differ- 
ent in the two networks, the original and the scrambled. 
Note that moving one link in general implies that the 
connectivity pattern of four vertices is affected and that 
fi has a non trivial dependence on q. We implemented 
our method on the modified network, determining a new 
list of putative functions for each unclassified protein, 
together with the relative probability (or frequency) of 
occurrence of the putative functions themselves. For con- 
venience we imagine these lists (one for each unclassified 
protein) to contain all possible functions, and associate 
a zero probability to those functions that have never oc- 
curred in the implementation of the method. We call 
Pis(fi) the probability that the unclassified protein i be- 
longs to the functional class s, in the network with a 
degree of dissimilarity /; with the original one. The case 
Pia(0) then corresponds to the functional classification 
obtained using the original network. A quantitative com- 
parison with the predictions made using the original net- 
work is provided by the overlap function 0,(/;) defined 
as follows: 9»(/j) = J2 S \/Pis{0)Pis(fl), that equals 1 
when Pis(fi) — Pis(0) for all s. We computed the av- 
erage of &i(fi) restricted to unclassified proteins with k 
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FIG. 2: a) Self-consistent test. The success rate of our 
method after a fraction /„ of classified proteins has been set 
unclassified. Each point represents the probability that the 
functional classification of proteins with k interacting part- 
ners, defined here as the top ranking for occurrence in the 
list of putative functions generated by our method, coincides 
with their real classification. We report the success rate for 
the values /„ = 0.4 and /„ = 0.7 in the upper and lower 
curve, respectively. The prediction quality for poorly con- 
nected nodes (degree 1 and 2) decreases to just 30% and it 
is degrading more rapidly than for highly connected proteins. 
This is a consequence of the fact that the corresponding pro- 
teins occupy a very marginal position where the method can- 
not take fully advantage of the global connectivity properties 
of the graph. In the inset we report the data for /„ — ► 0, 
i. e. when only a single protein is set unclassified. In this case 
it is possible to see that even for poorly connected proteins 
the methods gives a very good statistical reliability of the 
corresponding predictions, b) Tolerance to errors. The 
inset shows the overlap (Qi(fi) averaged over all unclassified 
proteins) between the functional predictions obtained using 
the original network and another with a degree of dissimi- 
larity /j defined as the percentage of edges between proteins 
couples which are different in the two networks. The analy- 
sis of the figure shows that a moderate amount of misplaced 
interactions do not preclude a reliable function assignment. 
Of course larger levels of errors lower the overlap, signaling 
that the two networks provide rather different configurations 
of functional assignment. The curve show a decreasing linear 
trend that extrapolated to fi = 1 gives a value of the overlap 
very small (less than 15% ), as expected. Still, the extrapo- 
lation to fi = 1 is somewhat inappropriate since a complete 
dissimilarity between the original and the scrambled network 
is hardly achievable by a random rewiring, and therefore un- 
justified in the present context. 
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interacting partners, obtaining that it does not vary too 
much with the node degree. In Fig. 2B we plot the aver- 
age of &i(fi) over all unclassified proteins as a function 
of fi. For 10% of dissimilarity, the overlap is around 
0.85%. Since each displaced edge corresponds to three 
to four proteins with different interactions, this shows 
that even if about 30-40% of proteins have at least a 
misplaced interaction due to experimental erroneous re- 
sults, an effective evaluation of the proteins' functions is 
not precluded. Of course larger levels of errors lower the 
overlap, signaling that the two networks provide rather 
different configurations of functional assignment. 

The method we propose appears as a general tool 
for the assignment of protein function pointing out 
that protein-protein interaction data can be an effective 
framework to deduce the function of unclassified proteins. 
The method also allows to determine multiple function- 
alities and takes into account self-consistently the effect 
of unclassified proteins in the final assignment configu- 
ration. Finally, the validity tests performed show that 
the method tolerates the inherent imperfection and the 
incomplete nature of the protein networks. 

Experimental protocol 

To each unclassified protein i = l,2,...,na function si 
is assigned, chosen among the F possible ones in order 
to globally minimize the following score function: 

E = - ^ JijS(aj, Qj) - y~] hijaj) , (1) 

i,j i 

where is the adjacency matrix of the interaction net- 
work for the unclassified proteins (Jy is equal to 1 if 
protein i and j interact and are unclassified, other- 
wise), 5(i,j) is the discrete delta function and hi(<7i) is 
the number of classified partners of protein i with func- 
tion Oi. 



The majority rule of [5,6] corresponds to minimize 
solely the second term on the r.h.s. of equation (1). The 
above can be achieved with local methods (i.e. consid- 
ering successively and independently each protein). In 
our method, the contribution to the total score of as- 
signing a protein i to functional class <ii depends also on 
the assignment made to all other proteins, resulting in a 
much harder computational task. The advantage is that 
the underlying requirements that interaction requires a 
common function is applied also to interactions between 
previously unclassified proteins, that are completely ig- 
nored with the majority rule approach. 



To overcome the computational difficulties and find the 
configuration or configurations that minimize E we per- 
form a simulated annealing [16] introducing an effective 
temperature T. We start with an initial random config- 
uration <Ti. Then, at each Monte-Carlo step, we select 
one protein at random and change its state from si to a\ , 
where <r- is selected at random among the possible states 
of protein i with the constraint u[ ^ <Ji. We then com- 
pute the score difference AE = E' — E between these two 
configurations. If AE < we accept the new configura- 
tion. If AE > 0, we accept the new configuration with 
probability r = exp(—AE/T) or keep the original config- 
uration with probability 1 — r. This Monte-Carlo step is 
repeated until E reaches a stationary value. Thereafter, 
T is decreased by a tiny amount. These two process, equi- 
libration at a given T and decrease of T is repeated un- 
til the protein states remain unchanged for a sufficiently 
long time. At the end the protein states are our predicted 
functional classification. Since the the minimum energy 
solution is not unique the simulated annealing process 
has been repeated several times, starting from different 
initial configurations. Finally we computed the fraction 
of times pis the protein i was observed in the final state s, 
which give us an estimate of the probability that protein 
i belongs to the functional classification s. References 
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